Take-home_Ex1: Geospatial Analytics for Public Good
Setting the Scene
The increasing digitization of urban infrastructures, such as public transportation and utilities, generates vast datasets using technologies like GPS and RFID. These datasets offer valuable insights into human movement patterns within a city, especially with the widespread deployment of smart cards and GPS devices in vehicles. However, current practices often limit the use of this data to basic tracking and mapping with GIS applications, primarily because conventional GIS lacks advanced functions for analyzing and modeling spatial and spatio-temporal data effectively. Enhanced analysis of these datasets could significantly contribute to better urban management and informed decision-making for both public and private urban transport services providers.
Objectives of Take-home_Ex1
Exploratory Spatial Data Analysis (ESDA) hold tremendous potential to address complex problems facing society. In this study, we are tasked to apply appropriate Local Indicators of Spatial Association (GLISA) and Emerging Hot Spot Analysis (EHSA) to undercover the spatial and spatio-temporal mobility patterns of public bus passengers in Singapore.
The Data
Aspatial data
For this take-home exercise, Passenger Volume by Origin Destination Bus Stops downloaded from LTA DataMall will be used. It contains the number of trips by weekdays and weekends from origin to destination bus stops.
For this exercise, the data is collected in August 2023.
The fields are YEAR_MONTH, DAY_TYPE, TIME_PER_HOUR, PT_TYPE , ORIGIN_PT_CODE, DESTINATION_PT_CODE, TOTAL_TRIPS .
A sample row for bus dataset could be ‘2023-08, WEEKDAY, 16, BUS, 28299, 28009, 63’. TIME_PER_HOUR of 16 represents data is collected between 4 pm to 5pm.
Geospatial data
Two geospatial data will be used in this study, they are:
Bus Stop Location from LTA DataMall’s static dataset. It provides information about all the bus stops currently being serviced by buses, including the bus stop code (identifier) and location coordinates.
hexagon, a hexagon layer of 250m (perpendicular distance between the centre of the hexagon and its edges.) will be be used to replace the relative coarse and irregular Master Plan 2019 Planning Sub-zone GIS data set of URA. Each spatial unit is regular in shape and finer.
The Task
The specific tasks of this take-home exercise are as follows:
Geovisualisation and Analysis
With reference to the time intervals provided in the table below, compute the passenger trips generated by origin at the hexagon level,
Peak hour period Bus tap on time Weekday morning peak 6am to 9am Weekday afternoon peak 5pm to 8pm Weekend/holiday morning peak 11am to 2pm Weekend/holiday evening peak 4pm to 7pm Display the geographical distribution of the passenger trips by using appropriate geovisualisation methods,
Describe the spatial patterns revealed by the geovisualisation (not more than 200 words per visual).
Local Indicators of Spatial Association (LISA) Analysis
Compute LISA of the passengers trips generate by origin at hexagon level.
Display the LISA maps of the passengers trips generate by origin at hexagon level. The maps should only display the significant (i.e. p-value < 0.05)
With reference to the analysis results, draw statistical conclusions (not more than 200 words per visual).
Getting Started
In this exercise, the following libraries will be used:
sfpackageto perform geoprocessing taskssfdep package which builds upon
spdeppackage (compute spatial weights matrix and spatially lagged variable for instance..)tmap to create geovisualisations
tidyversethat supports data science, analysis and communication task including creating static statistical graphs.knitrto create html tablesDT library to create interactive html tables
ggplot2 to plot graphs
plotly to plot interactive graphs
Importing Data
Aspatial data
Import the Passenger volume by Origin Destination Bus Stops dataset downloaded from the LTA Datamall by using the read_csv() of the readr package.
Check the datafields
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …
Processing the aspatial OD data
The ‘ORIGIN_PT_CODE’ and ‘DESTINATION_PT_CODE’ field is in character field. We will convert it to factor data type.
The function below will extract origin data based on the four time intervals required by the task. The expected arguments are
- daytype: ‘WEEKDAY’ or ‘WEEKENDS/HOLIDAY’
- timeinterval: c(8,10) if we want data from 8am to 11am.
The function will also compute the sum of all trips by ‘ORIGIN_PT_CODE’ for each time interval and stored under a new field called ‘TRIPS’.
Let’s get the data using get_origin function
Take a look at overview of all the four dataframes.
Geospatial data
First, we will import the Bus Stop Location shapefiles into R. The output will be a sf point object with 5161 points and 3 fields. As the raw data is in WSG84 geographical coordinate system, we will convert it to EPSG 3414, the projected coordinate system for Singapore.
busstop <- st_read(dsn="data/geospatial/BusStopLocation/BusStopLocation_Jul2023", layer = "BusStop") %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`C:\yixin-neo\ISSS624_AGA\Take-home_Ex1\data\geospatial\BusStopLocation\BusStopLocation_Jul2023'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
First 10 features:
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
1 22069 B06 OPP CEVA LOGISTICS POINT (13576.31 32883.65)
2 32071 B23 AFT TRACK 13 POINT (13228.59 44206.38)
3 44331 B01 BLK 239 POINT (21045.1 40242.08)
4 96081 B05 GRACE INDEPENDENT CH POINT (41603.76 35413.11)
5 11561 B05 BLK 166 POINT (24568.74 30391.85)
6 66191 B03 AFT CORFE PL POINT (30951.58 38079.61)
7 23389 B02A PEC LTD POINT (12476.9 32211.6)
8 54411 B02 BLK 527 POINT (30329.45 39373.92)
9 28531 B09 BLK 536 POINT (14993.31 36905.61)
10 96139 B01 BLK 148 POINT (41642.81 36513.9)
Checking for duplicates in the ‘BUS_STOP_N’ field reveals that there are about 16 repeated bus stop numbers. However, they have different geometry points in the simple feature busstop object above. These could be due to temprorary bus stops . We should retain all these rows as they might have different hexagon ‘fid’ values later.
busstop %>%
st_drop_geometry() %>%
group_by(BUS_STOP_N) %>%
filter(n()>1) %>%
ungroup() %>%
arrange(BUS_STOP_N)# A tibble: 32 × 3
BUS_STOP_N BUS_ROOF_N LOC_DESC
<chr> <chr> <chr>
1 11009 B04 Ghim Moh Ter
2 11009 B04-TMNL GHIM MOH TER
3 22501 B02 Blk 662A
4 22501 B02 BLK 662A
5 43709 B06 BLK 644
6 43709 B06 BLK 644
7 47201 UNK <NA>
8 47201 NIL W'LANDS NTH STN
9 51071 B21 MACRITCHIE RESERVOIR
10 51071 B21 MACRITCHIE RESERVOIR
# ℹ 22 more rows
Take for instance bus stop number 51071 with two different point geometry values.
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 28311.27 ymin: 36036.92 xmax: 28311.27 ymax: 36036.92
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
3470 51071 B21 MACRITCHIE RESERVOIR POINT (28311.27 36036.92)
Simple feature collection with 1 feature and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 28282.54 ymin: 36033.93 xmax: 28282.54 ymax: 36033.93
Projected CRS: SVY21 / Singapore TM
BUS_STOP_N BUS_ROOF_N LOC_DESC geometry
3472 51071 B21 MACRITCHIE RESERVOIR POINT (28282.54 36033.93)
Creating hexagon layer
Before we can plot the base layer, we have to create hexagonal grids of 250 m using the busstop points data using sf.
First , create a grid which the extent equals to the bounding box of the busstop points using st_make_grid().
- To create hexagons of 250m (centre to edge), we should input 500 for ‘cellsize’ parameter. ‘Cellsize’ is defined as the distance from edge to edge.
- Convert to
sfobject and add a unique identifier to each hexagon grid. The output has 5580 hexagon units. - Use st_intersects() to count the number of bus stops in each hexagon.
- Filter to retain only hexagons with at least one bus stop. The output
bs_counthas 1524 hexagon units.
area_hex_grid = st_make_grid(busstop,
cellsize= 500,
what = "polygons",
square = FALSE)
hex_grid_sf = st_sf(area_hex_grid) %>%
mutate(grid_id = 1:length(lengths(area_hex_grid)))
hex_grid_sf$bs = lengths(st_intersects(hex_grid_sf, busstop))
bs_count = filter(hex_grid_sf, bs > 0)
bs_countSimple feature collection with 1524 features and 2 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 26193.43 xmax: 48720.12 ymax: 53184.55
Projected CRS: SVY21 / Singapore TM
First 10 features:
area_hex_grid grid_id bs
1 POLYGON ((3970.122 27925.48... 34 1
2 POLYGON ((4220.122 28358.49... 65 1
3 POLYGON ((4470.122 30523.55... 99 1
4 POLYGON ((4720.122 28358.49... 127 1
5 POLYGON ((4720.122 30090.54... 129 2
6 POLYGON ((4720.122 30956.57... 130 1
7 POLYGON ((4720.122 31822.59... 131 1
8 POLYGON ((4970.122 28791.5,... 159 1
9 POLYGON ((4970.122 29657.53... 160 1
10 POLYGON ((4970.122 30523.55... 161 2
Visualisation of our hexagon base map.
We could save this hexagon layer to disk
Geospatial data wrangling
Combining busstop (polygon sf) and bs_count (point sf)
We need to get the corresponding bus stop number of each hexagon in bs_count hexagon layer.
The code chunk below performs points and hexagon polygon overlap using st_intersection() and the output will be in sf point object.
Before overlapping:
busstop : 5161 points
bs_count hexagon base layer: 1524 hexagons
After overlapping:
busstop_hex : 5161 points and this is good results because all the bus stop location data is mapped to our bs_count base map.
busstop_hex <- st_intersection(busstop, bs_count) %>%
select(BUS_STOP_N, BUS_ROOF_N, LOC_DESC, grid_id, bs)
busstop_hexSimple feature collection with 5161 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21 / Singapore TM
First 10 features:
BUS_STOP_N BUS_ROOF_N LOC_DESC grid_id bs
3269 25059 UNK AFT TUAS STH BLVD 34 1
2570 25751 B02D BEF TUAS STH AVE 14 65 1
254 26379 NIL YONG NAM 99 1
2897 25761 B03 OPP REC S'PORE 127 1
2827 25719 B01C THE INDEX 129 2
4203 26389 NIL BEF TUAS STH AVE 5 129 2
2403 26369 NIL SEE HUP SENG 130 1
1565 26299 B13 BEF TUAS STH AVE 6 131 1
2829 25741 B03 HALLIBURTON 159 1
2828 25711 B02C OPP THE INDEX 160 1
geometry
3269 POINT (3970.122 28063.28)
2570 POINT (4427.939 28758.67)
254 POINT (4473.292 30681.57)
2897 POINT (4737.082 28621.29)
2827 POINT (4799.476 30158.46)
4203 POINT (4776.694 30324.88)
2403 POINT (4604.363 31212.96)
1565 POINT (4879.489 31948.93)
2829 POINT (5060.733 29212.4)
2828 POINT (4831.438 30132.43)
We will drop the geometry because busstop_hex is a POINT sf object, there is no hex polygon geometry data for us to plot based on hexagon level. Furthermore, we have to process the attribute data. To get back the hexagon POLYGON geometry data, we can always left_join() bs_count df with our attribute table again later.
We will also use datatable() function of the DT library to print the data. The table is interactive and can perform basic sorting.
busstop_hex <- busstop_hex %>%
st_drop_geometry()
datatable(busstop_hex, class = 'cell-border stripe', options = list(pageLength = 5))Let us check for duplicates in busstop_hex df as it will be used to perform a left join later. The output shows that there are 11 duplicated rows.
# A tibble: 22 × 5
BUS_STOP_N BUS_ROOF_N LOC_DESC grid_id bs
<chr> <chr> <chr> <int> <int>
1 22501 B02 Blk 662A 1251 8
2 22501 B02 BLK 662A 1251 8
3 43709 B06 BLK 644 1904 7
4 43709 B06 BLK 644 1904 7
5 47201 UNK <NA> 2381 2
6 47201 NIL W'LANDS NTH STN 2381 2
7 11009 B04 Ghim Moh Ter 2395 7
8 11009 B04-TMNL GHIM MOH TER 2395 7
9 58031 UNK OPP CANBERRA DR 2939 7
10 58031 UNK OPP CANBERRA DR 2939 7
# ℹ 12 more rows
Removal of duplicates
Check again to be sure. All’s good.
[1] BUS_STOP_N BUS_ROOF_N LOC_DESC grid_id bs
<0 rows> (or 0-length row.names)
Note that there will be repeated bus stop ids with different bus stop roof number / hexagon id.
Combining each of the four aspatial dataframes with busstop_hex
The function below performs left outer join for each of the four aspatial origin dataframes with busstop_hex dataframe. The expected argument is
asp_df: name of aspatial origin dataframe like origin_day_am or origin_day_pm etc..
The function will also rename ‘ORIGIN_PT_CODE’ and ‘fid’ fields to ‘ORIGIN_BS’ and ‘ORIGIN_HEXFID’ respectively.
Use the ‘leftjoin’ function to get our dataframes containing grid_id and origin bus stop ids.
The number of rows increased by 4-5 after left join , this is due to 5 bus stops , each with two distinct bus stop roofs and located in different hexagon ids, that originated from the busstop and busstop_hex dataframes earlier.
Again, double check for duplicates
origin_data_day_am['ORIGIN_HEXFID', 'ORIGIN_BS'][duplicated(origin_data_day_am['ORIGIN_HEXFID', 'ORIGIN_BS']), ]# A tibble: 0 × 1
# ℹ 1 variable: ORIGIN_BS <chr>
origin_data_day_pm['ORIGIN_HEXFID', 'ORIGIN_BS'][duplicated(origin_data_day_pm['ORIGIN_HEXFID', 'ORIGIN_BS']), ]# A tibble: 0 × 1
# ℹ 1 variable: ORIGIN_BS <chr>
origin_data_end_am['ORIGIN_HEXFID', 'ORIGIN_BS'][duplicated(origin_data_end_am['ORIGIN_HEXFID', 'ORIGIN_BS']), ]# A tibble: 0 × 1
# ℹ 1 variable: ORIGIN_BS <chr>
origin_data_end_pm['ORIGIN_HEXFID', 'ORIGIN_BS'][duplicated(origin_data_end_pm['ORIGIN_HEXFID', 'ORIGIN_BS']), ]# A tibble: 0 × 1
# ℹ 1 variable: ORIGIN_BS <chr>
Bus stops with more than 1 bus roof number and exists in more than 1 hexagon in the df below.
# A tibble: 5 × 1
ORIGIN_BS
<chr>
1 52059
2 53041
3 67421
4 68091
5 68099
Using the skim() from the skimr package reveals that there are about missing ‘ORIGIN_HEXFID’ values in our dataframes. (see Summary table below).
After inspecting the structure of the skim output, the code chunk below
filters character or numeric field
filter for fields with missing values
select required columns
convert output to tibble df
rename the column name (static and dynamically)
combine all the output in a summary tibble df.
The reason for some bus stop ids without a hexagon id could be that the Bus Stop Location file is outdated (last update in July 2023) while the Passenger Volume by Origin Destination Bus Stops contains data in August 2023.
Show the code
#str(skim(origin_data_day_am))
get_na_hex <- function(df, col2header) {
result <- skimr::skim(df) %>%
filter(skim_type == "character" | skim_type == "numeric") %>%
filter(n_missing > 0) %>%
select(skim_variable, n_missing)%>%
as_tibble() %>%
rename_with(~col2header, n_missing, .cols = c(n_missing)) %>%
rename(Variable = skim_variable)
return(result)
}
r1 <- get_na_hex(origin_data_day_am, 'Wkday_morn')
r2 <- get_na_hex(origin_data_day_pm, 'Wkday_aft')
r3 <- get_na_hex(origin_data_end_am, 'Wkend_morn')
r4 <- get_na_hex(origin_data_end_pm, 'Wkend_aft')
bind_cols(r1,r2,r3,r4) %>%
select(1,2,4,6,8) %>%
slice(3) %>% kable()| Variable…1 | Wkday_morn | Wkday_aft | Wkend_morn | Wkend_aft |
|---|---|---|---|---|
| ORIGIN_HEXFID | 52 | 50 | 49 | 44 |
Bus stop ids without hexagon id
Since we are plotting the number of passenger trips generated by hexagon level, we should aggregate the total trips by ‘ORIGIN_HEXFID’ and store these values in a new field called ‘TTRIPS’.
In addtion, a new field ‘Ave_TTRIPS’ is calculated where it represents average weekday (TTRIPS / 5) and average weekend trips (TTRIPS/2).
After the left join, the total distinct of hexagons for each time intervals are 1491, 1489, 1499 and 1444. The total hexagon in bs_count was 1524. This suggests that were some bus stops with no passengers at different time intervals.
get_ttrips_wkday <- function(df) {
result <- df %>%
group_by(ORIGIN_HEXFID) %>%
summarise(
TTRIPS = sum(TRIPS),
AVG_TRIPS = ceiling(sum(TRIPS) / 5),
DESC = paste(LOC_DESC, collapse = ', ')
) %>%
ungroup()
return(result)
}
get_ttrips_wkend <- function(df) {
result <- df %>%
group_by(ORIGIN_HEXFID) %>%
summarise(
TTRIPS = sum(TRIPS),
AVG_TRIPS = ceiling(sum(TRIPS) / 2),
DESC = paste(LOC_DESC, collapse = ', ')
) %>%
ungroup()
return(result)
}
origin_data_day_am_hex <- get_ttrips_wkday(origin_data_day_am) # 5010 to 1491 rows
origin_data_day_pm_hex <- get_ttrips_wkday(origin_data_day_pm) # 4970 to 1489 rows
origin_data_end_am_hex <- get_ttrips_wkend(origin_data_end_am) # 4999 to 1499 rows
origin_data_end_pm_hex <- get_ttrips_wkend(origin_data_end_pm) # 4627 to 1444 rowsPrint out the above four dataframes
Retrieve hexagon geometry coordinates
In order to plot choropleth maps, we need the geometry data from bs_countsf polygon object.
The function below performs a left join with bs_count and the attributes dataframes and also assign 0 TTRIPS value to a hexagon without any passengers.
get_hexgeo <- function(df) {
result <- left_join(bs_count, df,
by = c('grid_id' = 'ORIGIN_HEXFID')) %>%
mutate(TTRIPS = if_else(is.na(TTRIPS),0, TTRIPS),
AVG_TRIPS = if_else(is.na(AVG_TRIPS), 0, AVG_TRIPS))
return(result)
}
day_am <- get_hexgeo(origin_data_day_am_hex)
day_pm <- get_hexgeo(origin_data_day_pm_hex)
end_am <- get_hexgeo(origin_data_end_am_hex)
end_pm <- get_hexgeo(origin_data_end_pm_hex)Task 1: Geovisulisation and Analysis
In this section , the choropleth maps will be used to show the geographical distribution of the passenger trips by origin. However, we need to plan for the classification for the maps. We could check the distribution of the ‘TTRIPS’ field across ALL four time intervals.
FIrst, combine all the ‘TTRIPS’ and create a new column ‘source’ to retain the name of the time interval.
Plot boxplots to get a general sensing of the distribution of ‘TTRIPS’ across different time intervals.
ggplotly(ggplot(data=combine,
aes(y=AVG_TRIPS,
x=source)) +
geom_boxplot() +
geom_point(stat="summary",
fun.y="mean",
colour ="red",
size=2) +
labs(x = "Time Intervals", y = "Total Trips", title='Daily average origin trips by time intervals') +
theme_minimal() +
theme(legend.key.size = unit(0.5,'cm'),
legend.position="bottom",
plot.title = element_text(size = 12,
face='bold'),
axis.title = element_text(size = 11 , face = "bold"),
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 0, hjust = 1)))From the chart above, both the weekday ridership and the number of bus stops utilised severely outweighs the weekend period.
Final check on the summary statistics of TTRIPS before setting custom break points.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 83.5 542.0 1820.2 2116.5 67412.0
Because of the large number of outliers bus stops, let us calculate the upper bound of the outliers.
Q1 <- summary(combine$AVG_TRIPS)[2]
Q3 <- summary(combine$AVG_TRIPS)[5]
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
#lower_bound
upper_bound3rd Qu.
5166
With reference to the box plot, summary statistics and upper bound values,
Min : 1 , Max : 67,412
quantile break points can be 83, 540, 2110
break vector is therefore be c(0, 83, 540, 2110, 5160, 50000, 67412)
The last two intervals would distinguish the outliers and extreme outliers (bus stops with very high daily passengers number) on the map later.
plotmap <- function(df, title) {
df2<- df %>% top_n(10, AVG_TRIPS)
result <- tm_shape(df)+
tm_fill("AVG_TRIPS",
breaks = c(0, 83, 540, 2110, 5160, 67412), #style = "quantile",
palette = "Blues",
alpha= 0.5,
#legend.hist = TRUE,
#legend.is.portrait = TRUE,
#legend.hist.z = 0.3,
title = "Passengers Trip") +
tm_layout(main.title = title,
main.title.position = "center",
main.title.size = 1.2,
legend.height = 0.45,
legend.width = 0.35,
#legend.outside = TRUE,
#legend.text.size= 0.6,
#inner.margins = c(0.01, 0.01, 0, .15),
#legend.position = c("right", "top"),
#bg.color = "black",
#main.title.color = 'white',
#legend.title.color = 'white',
#legend.text.color= 'white',
bg.color = "mintcream",
frame = TRUE) +
tm_borders(alpha = 0.5) +
tm_compass(type="8star", size = 2) +
tm_scale_bar() +
tm_grid(alpha =0.2) +
tm_credits("Source: Passenger Bus origin and destination data and Bus Stop Location data from LTA datamall",
position = c("left", "bottom")) +
tm_shape(df2) +
tm_bubbles(size='AVG_TRIPS', col='red',alpha=0.5, scale=.6)
return(result)
}p1 <- plotmap(day_am, 'Weekday-Morning (Daily Average) \nPassenger Trips generated at Hex lvl')
p2 <- plotmap(day_pm, 'Weekday-Afternoon (Daily Average) \nPassenger Trips generated at Hex lvl')
p3 <- plotmap(end_am, 'Weekend-Morning (Daily Average) \nPassenger Trips generated at Hex lvl')
p4 <- plotmap(end_pm, 'Weekend-Afternoon (Daily Average) \nPassenger Trips generated at Hex lvl')Map of Weekday-Morning (Daily Average) Passenger Trips generated at Hex lvl
Map of Weekday-Afternoon (Daily Average) Passenger Trips generated at Hex lvl
Map of Weekend-Morning (Daily Average) Passenger Trips generated at Hex lvl
Map of Weekend-Afternoon (Daily Average) Passenger Trips generated at Hex lvl
Facet View
Let us plot all four maps side-by-side for easy comparison.
Discussion
The top row shows the average daily Weekday morning and afternoon passenger trips while the bottom row shows the Weekend morning and afternoon ridership generated at hexagon level. The classification method used is manual; the first three intervals represents the 1st, 2nd and 3rd quantile; and the last two intervals represents all the outliers and extreme outliers. In addition, the red bubbles represent the hexagons with the top 10 average daily passengers trip in each time-interval. We can observe the following:
Average daily ridership generation is the greatest on weekday morning, followed by weekday afternoon and weekend morning (cannot distinguish between these two at one glance). Lastly, weekend afternoon registers the least ridership volume.
At first glance, the top two hexagons with the highest origin ridership volume are hex id 1251 and hex id 2411. They are both consistently in the top 2 origin bus stop for all four time intervals. The bus stops (refer to the interactive data table above) that fall within these hexagons are
hex id 1251: BOON LAY INT, BLK 662D, OPP BLK 662C, Blk 662A, BLK 664C
hex id 2411: WOODLANDS REG INT, W’LANDS CIVIC CTR, OPP W’LANDS CIVIC CTR, BEF W’LANDS STN EXIT 7, W’LANDS STN EXIT 5, W’LANDS STN EXIT 4, BLK 894C, BLK 515
There are four hexagon ids that are consistently in the top 10 origin bus stops for all four time intervals. They are
Hex id 3239: ANG MO KIO INT, BLK 322, AFT ANG MO KIO STN EXIT A, BLK 422, AFT ANG MO KIO INT, ANG MO KIO STN
Hex id 4539: TAMPINES INT, OUR TAMPINES HUB, OPP OUR TAMPINES HUB, UOB TAMPINES CTR, OPP CENTURY SQ, Tampines Stn Exit D, Aft Tampines Stn Exit E
Hex id 4349: BEDOK BUS INT, BEDOK STN EXIT B, BEDOK STN EXIT A, OPP PANASONIC, PANASONIC
The two hexagons that falls within top 10 in the weekdays but fall out of top 10 during weekends are
hex id 909: BEF JOO KOON INT, OPP JOO KOON INT, MOLEX, OPP MOLEX, OPP FAIRPRICE HUB, JOO KOON INT
hex id 2054: CLEMENTI STN, CLEMENTI STN, BLK 431
Interestingly, the hexagon within top 10 on weekday and weekend mornings and fall out of this top 10 position on weekday and weekend afternoons is id 3204. Suggesting that this region is more busy in the morning than afternoon.
- hex id 3204 : TOA PAYOH BUS INT, OPP TOA PAYOH STN, TOA PAYOH STN, BLK 138B, BLK 84B, BLK 79C
Across all four time-intervals, there appears to be clusters of High-high and low-low as well as outliers of Low-High and High-Low in our area of study. However , we cannot be sure until we run the local Moran’s I statistical test.
Task 2: Local Indicators of Spatial Association (LISA) Analysis
In this section, we will use LISA statistics to identify existence of cluster or outlier for a given variable, in our case, the total number of trips generated at hexagon layer. We will apply the Moran’s I statistic to detect cluster and / or outliers for total trips generated at hexagon layer.
Computing Distance-Based Spatial Weights Matrix
Before we can compute the global/local spatial autocorrelation statistics, we need to construct a spatial weights of our study area. The spatial weights is used to define who are the neighbours of the spatial units (hexagons) in our study are. There are a few general rules:
Features should have at least one neighbour, each feature should not be a neighbour with all other features.
If data is skewed, each feature should have at least 8 neighbours (lets try with 6 first)
We will rule out the contiguity methods because as seen in the geovisualisation above, we can see that the location of bus stops are rather ‘sparse’ in some regions like the central catchment areas, military training areas and airports, resulting in gaps in between groups of hexagons. Therefore, we are more inclined towards distance methods.
There are two main types of distance-based methods, namely:
Adaptive distance-based (Fixed number of neighbours 6 or 12)
This is our choice as our data is highly skewed to the right. Each hexagon will be guaranteed at least n neighbours, which makes it useful when testing for statistical significance later.
Fixed-distance threshold
There is no particular reason why this method is not selected because our spatial units are all hexagon in shape and have same size.
Identify adaptive distance weights
We will set each hexagon to have 6 neighbours each using the code chunk below.
st_knn()of sfdep will output a list of neighbours ‘nb’ of a hexbased on ‘nb’ column, st_weights() of sfdep will generate row-standardised spatial weights
‘.before = 1’ will put the two columns at the front of our sf dataframe.
Taking a sneak peak at one of the spatial weights matrix above
| nb | wt | grid_id | bs | TTRIPS | AVG_TRIPS | DESC | area_hex_grid |
|---|---|---|---|---|---|---|---|
| 2, 3, 4, 5, 6, 8, 9, 10, 12, 13, 16, 17, 22, 23, 24, 30, 38, 39 | 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556 | 34 | 1 | 61 | 13 | AFT TUAS STH BLVD | POLYGON ((3970.122 27925.48… |
| 1, 3, 4, 5, 6, 8, 9, 10, 12, 13, 16, 17, 22, 23, 24, 30, 38, 39 | 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556 | 65 | 1 | 43 | 9 | BEF TUAS STH AVE 14 | POLYGON ((4220.122 28358.49… |
| 5, 6, 7, 9, 10, 12, 13, 14, 16, 17, 18, 23, 24, 25, 30, 31, 32, 40 | 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556 | 99 | 1 | 43 | 9 | YONG NAM | POLYGON ((4470.122 30523.55… |
| 1, 2, 3, 5, 8, 9, 10, 12, 13, 16, 17, 22, 23, 24, 30, 38, 39, 40 | 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556 | 127 | 1 | 135 | 27 | OPP REC S’PORE | POLYGON ((4720.122 28358.49… |
| 3, 6, 8, 9, 10, 12, 13, 14, 16, 17, 18, 23, 24, 25, 30, 31, 39, 40 | 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556 | 129 | 2 | 1119 | 224 | THE INDEX, BEF TUAS STH AVE 5 | POLYGON ((4720.122 30090.54… |
| 3, 5, 7, 9, 10, 11, 13, 14, 16, 17, 18, 19, 24, 25, 26, 31, 32, 41 | 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556, 0.05555556 | 130 | 1 | 63 | 13 | SEE HUP SENG | POLYGON ((4720.122 30956.57… |
Global autocorrelation of spatial association (Global Moran’s I with simulation)
Global spatial association assesses the overall spatial pattern of a variable ‘TTRIPS’ across the entire study area. It provides a single value or metric that summarizes the extent to which similar values cluster together or are dispersed across the entire geographic space.

Set seed to ensure that the computation is reproducible.
Global Moran’s I for weekday morning passenger trips generated by hexagon level. Simulated data is used as we do not assume normality and randomization. The number of simulations is 99+1 = 100.
We can use the global_moran_perm() function of the sfdep package to do it. It all starts with the wm_day_am spatial weights matrix.
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.17009, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Global Moran’s I for weekday afternoon passenger trips generated by hexagon level.
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.18789, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Global Moran’s I for weekend morning passenger trips generated by hexagon level.
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.13575, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
Global Moran’s I for weekend afternoon passenger trips generated by hexagon level.
Monte-Carlo simulation of Moran I
data: x
weights: listw
number of simulations + 1: 100
statistic = 0.17146, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided
For all the four time intervals, since p-value for global Moran’s I is smaller than 0.05 , we can reject the null hypothesis that the spatial patterns is random. Because the Moran’s I statistics is greater than 0, we can infer the spatial distribution shows sign of clustering for all four time intervals. This result is rather consistent with the choropleth maps plotted earlier.
Local autocorrelation of spatial association
Local spatial association examines the spatial patterns at a more detailed, local level. Instead of providing a single summary value for the entire study area, local measures identify specific areas where the spatial association is particularly strong or weak. If the Local Moran’s I is positive, it suggests clusters of high-high or low-low. If negative, it suggests outliers of low-high or high-low.

Local Moran’s I for weekday morning passenger trips generated by hexagon level.
We can use the local_moran() function of the sfdep package to do so, It all starts with the wm_day_am spatial weights matrix, this function automatically compute the neighbour lists and weights using simulated data. As this function outputs results in a group format, we will need to unnest() in order to access the output.
LISA of the passengers trips generate by origin at hexagon level.
New columns are added to the lisa_day_am simple feature dataframe smartly. The new columns are
ii: local moran statistic
eii: expectation of local moran statistic; for localmoran_permthe permutation sample means
var_ii: variance of local moran statistic; for localmoran_permthe permutation sample standard deviations
z_ii: standard deviate of local moran statistic; for localmoran_perm based on permutation sample means and standard deviations
p_ii: p-value of local moran statistic using pnorm(); for localmoran_perm using standard deviatse based on permutation sample means and standard deviations
p_ii_sim: For
localmoran_perm(),rank()andpunif()of observed statistic rank for [0, 1] p-values usingalternative=p_folded_sim: the simulation folded [0, 0.5] range ranked p-value
skewness: For
localmoran_perm, the output of e1071::skewness() for the permutation samples underlying the standard deviateskurtosis: For
localmoran_perm, the output of e1071::kurtosis() for the permutation samples underlying the standard deviatesmean: contains the quadrant ‘type’ of a typical Moran Scatterplot
median: contains the quadrant ‘type’ of a typical Moran Scatterplot (use this if dataset is highly skewed).
Visualising significant Local Moran’s I at 95% confidence level
We will use tmap core functions to build choropleth maps, using the Local Moran’s I field and p-value field for all four time intervals.
Only the significant values of Local Moran’s I values at 95% confidence level are plotted.
get_sig_lmi_map <- function(lisatable, title) {
sig_lisatable <- lisatable %>%
filter(p_ii_sim < 0.05)
result <- tm_shape(lisatable) +
tm_polygons() +
tm_borders(alpha = 0.5) +
tm_shape(sig_lisatable) +
tm_fill("ii") +
tm_borders(alpha = 0.4) +
tm_layout(main.title = title,
main.title.size = 1.3)
return(result)
}
sig_lmi_1 <- get_sig_lmi_map(lisa_day_am,"Local Moran's I of Total Trips generated on Weekday Morning at 95% CL" )
sig_lmi_2 <- get_sig_lmi_map(lisa_day_pm, "Local Moran's I of Total Trips generated on Weekday Afternoon at 95% CL" )
sig_lmi_3 <- get_sig_lmi_map(lisa_end_am, "Local Moran's I of Total Trips generated on Weekend Morning at 95% CL" )
sig_lmi_4 <- get_sig_lmi_map(lisa_end_pm, "Local Moran's I of Total Trips generated on Weekend Afternoon at 95% CL" )Show the code

From the Local Moran choropleth maps above, orange regions would indicate outliers regions however, we would not know whether they are low-high or high-low regions.
The green regions would indicate clusters however we would not know whether they are high-high or low-low regions.
To find out, we have to plot the LISA maps in the next section.
Visualizing significant LISA map at 95% confidence level
The LISA maps that we are building now are categorical map showing outliers (Low-high or High-low) and clusters (high-high or low-low).
We will use median values to generate the quadrants (HH, LH, HL or LL) because our data is highly skewed to the right, otherwise we can use the mean values.
get_sig_lisa_map <- function(lisatable, title) {
sig_lisatable <- lisatable %>%
filter(p_ii_sim < 0.05)
result <- tm_shape(lisatable) +
tm_polygons(alpha = 0.5) +
tm_borders(alpha = 0.5) +
tm_shape(sig_lisatable) +
tm_fill("median",
palette = c("#2c7bb6", "#fdae61", "#abd9e9", "#d7191c"),
alpha= 0.7) +
tm_dots('grid_id', alpha=0.05) +
tm_borders(alpha = 0.4) +
tm_layout(main.title = title,
main.title.size = 1.5,
legend.position = c("left", "top"))
return(result)
}
sig_lisa_1 <- get_sig_lisa_map(lisa_day_am,"LISA categories generated on Weekday Morning at 95% CL" )
sig_lisa_2 <- get_sig_lisa_map(lisa_day_pm, "LISA categories generated on Weekday Afternoon at 95% CL" )
sig_lisa_3 <- get_sig_lisa_map(lisa_end_am, "LISA categories generated on Weekend Morning at 95% CL" )
sig_lisa_4 <- get_sig_lisa_map(lisa_end_pm, "LISA categories generated on Weekend Afternoon at 95% CL" )LISA categories generated on Weekday Morning at 95% CL
To leverage on the interactivity of the map, clicking on the center of the significant hexgons reveal their grid id for easy identification of the bus stops that fall within the hexagon. (Use the interactive datatable above to search for hexagon grid id).

